INTERSPEECH.2017 - Speech Processing | Cool Papers

#1 A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation [PDF] [Copy] [Kimi¹] [REL]

Authors: Yannan Wang, Jun Du, Li-Rong Dai, Chin-Hui Lee

In contrast to the conventional minimum mean squared error (MMSE) training criterion for nonlinear spectral mapping based on deep neural networks (DNNs), we propose a probabilistic learning framework to estimate the DNN parameters for single-channel speech separation. A statistical analysis of the prediction error vector at the DNN output reveals that it follows a unimodal density for each log power spectral component. By characterizing the prediction error vector as a multivariate Gaussian density with zero mean vector and an unknown covariance matrix, we present a maximum likelihood (ML) approach to DNN parameter learning. Our experiments on the Speech Separation Challenge (SSC) corpus show that the proposed learning approach can achieve a better generalization capability and a faster convergence than MMSE-based DNN learning. Furthermore, we demonstrate that the ML-trained DNN consistently outperforms MMSE-trained DNN in all the objective measures of speech quality and intelligibility in single-channel speech separation.

Subject: INTERSPEECH.2017 - Speech Processing

#2 Deep Clustering-Based Beamforming for Separation with Unknown Number of Sources [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Takuya Higuchi, Keisuke Kinoshita, Marc Delcroix, Kateřina Žmolíková, Tomohiro Nakatani

This paper extends a deep clustering algorithm for use with time-frequency masking-based beamforming and perform separation with an unknown number of sources. Deep clustering is a recently proposed single-channel source separation algorithm, which projects inputs into the embedding space and performs clustering in the embedding domain. In deep clustering, bi-directional long short-term memory (BLSTM) recurrent neural networks are trained to make embedding vectors orthogonal for different speakers and concurrent for the same speaker. Then, by clustering the embedding vectors at test time, we can estimate time-frequency masks for separation. In this paper, we extend the deep clustering algorithm to a multiple microphone setup and incorporate deep clustering-based time-frequency mask estimation into masking-based beamforming, which has been shown to be more effective than masking for automatic speech recognition. Moreover, we perform source counting by computing the rank of the covariance matrix of the embedding vectors. With our proposed approach, we can perform masking-based beamforming in a multiple-speaker case without knowing the number of speakers. Experimental results show that our proposed deep clustering-based beamformer achieves comparable source separation performance to that obtained with a complex Gaussian mixture model-based beamformer, which requires the number of sources in advance for mask estimation.

Subject: INTERSPEECH.2017 - Speech Processing

#3 Time-Frequency Masking for Blind Source Separation with Preserved Spatial Cues [PDF] [Copy] [Kimi¹] [REL]

Authors: Shadi Pirhosseinloo, Kostas Kokkinakis

In this paper, we address the problem of speech source separation by relying on time-frequency binary masks to segregate binaural mixtures. We describe an algorithm which can tackle reverberant mixtures and can extract the original sources while preserving their original spatial locations. The performance of the proposed algorithm is evaluated objectively and subjectively, by assessing the estimated interaural time differences versus their theoretical values and by testing for localization acuity in normal-hearing listeners for different spatial locations in a reverberant room. Experimental results indicate that the proposed algorithm is capable of preserving the spatial information of the recovered source signals while keeping the signal-to-distortion and signal-to-interference ratios high.

Subject: INTERSPEECH.2017 - Speech Processing

#4 Variational Recurrent Neural Networks for Speech Separation [PDF] [Copy] [Kimi¹] [REL]

Authors: Jen-Tzung Chien, Kuan-Ting Kuo

We present a new stochastic learning machine for speech separation based on the variational recurrent neural network (VRNN). This VRNN is constructed from the perspectives of generative stochastic network and variational auto-encoder. The idea is to faithfully characterize the randomness of hidden state of a recurrent neural network through variational learning. The neural parameters under this latent variable model are estimated by maximizing the variational lower bound of log marginal likelihood. An inference network driven by the variational distribution is trained from a set of mixed signals and the associated source targets. A novel supervised VRNN is developed for speech separation. The proposed VRNN provides a stochastic point of view which accommodates the uncertainty in hidden states and facilitates the analysis of model construction. The masking function is further employed in network outputs for speech separation. The benefit of using VRNN is demonstrated by the experiments on monaural speech separation.

Subject: INTERSPEECH.2017 - Speech Processing

#5 Detecting Overlapped Speech on Short Timeframes Using Deep Learning [PDF] [Copy] [Kimi¹] [REL]

Authors: Valentin Andrei, Horia Cucu, Corneliu Burileanu

The intent of this work is to demonstrate how deep learning techniques can be successfully used to detect overlapped speech on independent short timeframes. A secondary objective is to provide an understanding on how the duration of the signal frame influences the accuracy of the method. We trained a deep neural network with heterogeneous layers and obtained close to 80% inference accuracy on frames going as low as 25 milliseconds. The proposed system provides higher detection quality than existing work and can predict overlapped speech with up to 3 simultaneous speakers. The method exposes low response latency and does not require a high amount of computing power.

Subject: INTERSPEECH.2017 - Speech Processing

#6 Ideal Ratio Mask Estimation Using Deep Neural Networks for Monaural Speech Segregation in Noisy Reverberant Conditions [PDF] [Copy] [Kimi] [REL]

Authors: Xu Li, Junfeng Li, Yonghong Yan

Monaural speech segregation is an important problem in robust speech processing and has been formulated as a supervised learning problem. In supervised learning methods, the ideal binary mask (IBM) is usually used as the target because of its simplicity and large speech intelligibility gains. Recently, the ideal ratio mask (IRM) has been found to improve the speech quality over the IBM. However, the IRM was originally defined in anechoic conditions and did not consider the effect of reverberation. In this paper, the IRM is extended to reverberant conditions where the direct sound and early reflections of target speech are regarded as the desired signal. Deep neural networks (DNNs) is employed to estimate the extended IRM in the noisy reverberant conditions. The estimated IRM is then applied to the noisy reverberant mixture for speech segregation. Experimental results show that the estimated IRM provides substantial improvements in speech intelligibility and speech quality over the unprocessed mixture signals under various noisy and reverberant conditions.

Subject: INTERSPEECH.2017 - Speech Processing

#7 Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities [PDF] [Copy] [Kimi¹] [REL]

Authors: Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality such as phonetic property and speaking rate contained in the posterior probabilities because the source posterior probabilities are directly used for predicting target speech parameters. In this work, we assume that the training data partly include parallel speech data and propose sequence-to-sequence learning between the source and target posterior probabilities. The conversion models perform non-linear and variable-length transformation from the source probability sequence to the target one. Further, we propose a joint training algorithm for the modules. In contrast to conventional VC, which separately trains the speech recognition that estimates posterior probabilities and the speech synthesis that predicts target speech parameters, our proposed method jointly trains these modules along with the proposed probability conversion modules. Experimental results demonstrate that our approach outperforms the conventional VC.

Subject: INTERSPEECH.2017 - Speech Processing

#8 Learning Latent Representations for Speech Generation and Transformation [PDF] [Copy] [Kimi²] [REL]

Authors: Wei-Ning Hsu, Yu Zhang, James Glass

An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images. In this paper, we apply a convolutional VAE to model the generative process of natural speech. We derive latent space arithmetic operations to disentangle learned latent representations. We demonstrate the capability of our model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data.

Subject: INTERSPEECH.2017 - Speech Processing

#9 Parallel-Data-Free Many-to-Many Voice Conversion Based on DNN Integrated with Eigenspace Using a Non-Parallel Speech Corpus [PDF] [Copy] [Kimi²] [REL]

Authors: Tetsuya Hashimoto, Hidetsugu Uchida, Daisuke Saito, Nobuaki Minematsu

This paper proposes a novel approach to parallel-data-free and many-to-many voice conversion (VC). As 1-to-1 conversion has less flexibility, researchers focus on many-to-many conversion, where speaker identity is often represented using speaker space bases. In this case, utterances of the same sentences have to be collected from many speakers. This study aims at overcoming this constraint to realize a parallel-data-free and many-to-many conversion. This is made possible by integrating deep neural networks (DNNs) with eigenspace using a non-parallel speech corpus. In our previous study, many-to-many conversion was implemented using DNN, whose training was assisted by EVGMM conversion. By realizing the function of EVGMM equivalently by constructing eigenspace with a non-parallel speech corpus, the desired conversion is made possible. A key technique here is to estimate covariance terms without given parallel data between source and target speakers. Experiments show that objective assessment scores are comparable to those of the baseline system trained with parallel data.

Subject: INTERSPEECH.2017 - Speech Processing

#10 Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks [PDF] [Copy] [Kimi²] [REL]

Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, Kunio Kashino

We propose a training framework for sequence-to-sequence voice conversion (SVC). A well-known problem regarding a conventional VC framework is that acoustic-feature sequences generated from a converter tend to be over-smoothed, resulting in buzzy-sounding speech. This is because a particular form of similarity metric or distribution for parameter training of the acoustic model is assumed so that the generated feature sequence that averagely fits the training target example is considered optimal. This over-smoothing occurs as long as a manually constructed similarity metric is used. To overcome this limitation, our proposed SVC framework uses a similarity metric implicitly derived from a generative adversarial network, enabling the measurement of the distance in the high-level abstract space. This would enable the model to mitigate the over-smoothing problem caused in the low-level data space. Furthermore, we use convolutional neural networks to model the long-range context-dependencies. This also enables the similarity metric to have a shift-invariant property; thus, making the model robust against misalignment errors involved in the parallel data. We tested our framework on a non-native-to-native VC task. The experimental results revealed that the use of the proposed framework had a certain effect in improving naturalness, clarity, and speaker individuality.

Subject: INTERSPEECH.2017 - Speech Processing

#11 A Mouth Opening Effect Based on Pole Modification for Expressive Singing Voice Transformation [PDF] [Copy] [Kimi¹] [REL]

Authors: Luc Ardaillon, Axel Roebel

Improving expressiveness in singing voice synthesis systems requires to perform realistic timbre transformations, e.g. for varying voice intensity. In order to sing louder, singers tend to open their mouth more widely, which changes the vocal tract’s shape and resonances. This study shows, by means of signal analysis and simulations, that the main effect of mouth opening is an increase of the 1st formant’s frequency (F1) and a decrease of its bandwidth (BW1). From these observations, we then propose a rule for producing a mouth opening effect, by modifying F1 and BW1, and an approach to apply this effect on real voice sounds. This approach is based on pole modification, by changing the AR coefficients of an estimated all-pole model of the spectral envelope. Finally, listening tests have been conducted to evaluate the effectiveness of the proposed effect.

Subject: INTERSPEECH.2017 - Speech Processing

#12 Siamese Autoencoders for Speech Style Extraction and Switching Applied to Voice Identification and Conversion [PDF] [Copy] [Kimi¹] [REL]

Authors: Seyed Hamidreza Mohammadi, Alexander Kain

We propose an architecture called siamese autoencoders for extracting and switching pre-determined styles of speech signals while retaining the content. We apply this architecture to a voice conversion task in which we define the content to be the linguistic message and the style to be the speaker’s voice. We assume two or more data streams with the same content but unique styles. The architecture is composed of two or more separate but shared-weight autoencoders that are joined by loss functions at the hidden layers. A hidden vector is composed of style and content sub-vectors and the loss functions constrain the encodings to decompose style and content. We can select an intended target speaker either by supplying the associated style vector, or by extracting a new style vector from a new utterance, using a proposed style extraction algorithm. We focus on in-training speakers but perform some initial experiments for out-of-training speakers as well. We propose and study several types of loss functions. The experiment results show that the proposed many-to-many model is able to convert voices successfully; however, its performance does not surpass that of the state-of-the-art one-to-one model’s.

Subject: INTERSPEECH.2017 - Speech Processing

#13 Audio Content Based Geotagging in Multimedia [PDF] [Copy] [Kimi¹] [REL]

Authors: Anurag Kumar, Benjamin Elizalde, Bhiksha Raj

In this paper we propose methods to extract geographically relevant information in a multimedia recording using its audio content. Our method primarily is based on the fact that urban acoustic environment consists of a variety of sounds. Hence, location information can be inferred from the composition of sound events/classes present in the audio. More specifically, we adopt matrix factorization techniques to obtain semantic content of recording in terms of different sound classes. We use semi-NMF to for to do audio semantic content analysis using MFCCs. These semantic information are then combined to identify the location of recording. We show that these semantic content based geotagging can perform significantly better than state of art methods.

Subject: INTERSPEECH.2017 - Speech Processing

#14 Time Delay Histogram Based Speech Source Separation Using a Planar Array [PDF] [Copy] [Kimi¹] [REL]

Authors: Zhaoqiong Huang, Zhanzhong Cao, Dongwen Ying, Jielin Pan, Yonghong Yan

Bin-wise time delay is a valuable clue to form the time-frequency (TF) mask for speech source separation on the two-microphone array. On widely spaces microphones, however, the time delay estimation suffers from spatial aliasing. Although histogram is a simple and effective method to tackle the problem of spatial aliasing, it can not be directly applied on planar arrays. This paper proposes a histogram-based method to separate multiple speech sources on the arbitrary-size planar array, where the spatial aliasing is resisted. Time delay histogram is firstly utilized to estimate the delays of multiple sources on each microphone pair. The estimated delays on all pairs are then incorporated into an azimuth histogram by means of the pairwise combination test. From the azimuth histogram, the direction-of-arrivals (DOAs) and the number of sources are obtained. Eventually, the TF mask is determined based on the estimated DOAs. Some experiments were conducted under various conditions, confirming the superiority of the proposed method.

Subject: INTERSPEECH.2017 - Speech Processing

#15 Excitation Source Features for Improving the Detection of Vowel Onset and Offset Points in a Speech Sequence [PDF] [Copy] [Kimi¹] [REL]

Authors: Gayadhar Pradhan, Avinash Kumar, S. Shahnawazuddin

The task of detecting the vowel regions in a given speech signal is a challenging problem. Over the years, several works on accurate detection of vowel regions and the corresponding vowel onset points (VOPs) and vowel end points (VEPs) have been reported. A novel front-end feature extraction technique exploiting the temporal and spectral characteristics of the excitation source information in the speech signal is proposed in this paper to improve the detection of vowel regions, VOPs and VEPs. To do the same, a three-class classifiers (vowel, non-vowel and silence) is developed on the TIMIT database using the proposed features as well as mel-frequency cepstral coefficients (MFCC). Statistical modeling based on deep neural network has been employed for learning the parameters. Using the developed three-class classifier, a given speech sample is then forced aligned against the trained acoustic models to detect the vowel regions. The use of proposed feature results in detection of vowel regions quite different from those obtained through the MFCC. Exploiting the differences in the evidences obtained by using the two kinds of features, a technique to combine the evidences is also proposed in order to get a better estimate of the VOPs and VEPs.

Subject: INTERSPEECH.2017 - Speech Processing

#16 A Contrast Function and Algorithm for Blind Separation of Audio Signals [PDF] [Copy] [Kimi¹] [REL]

Authors: Wei Gao, Roberto Togneri, Victor Sreeram

This paper presents a contrast function and associated algorithm for blind separation of audio signals. The contrast function is based on second-order statistics to minimize the ratio between the product of the diagonal entries and the determinant of the covariance matrix. The contrast function can be minimized by a batch and adaptive gradient descent method to formulate a blind source separation algorithm. Experimental results on realistic audio signals show that the proposed algorithm yielded comparable separation performance with benchmark algorithms for speech signals, and outperformed benchmark algorithms for music signals.

Subject: INTERSPEECH.2017 - Speech Processing

#17 Weighted Spatial Covariance Matrix Estimation for MUSIC Based TDOA Estimation of Speech Source [PDF] [Copy] [Kimi¹] [REL]

Authors: Chenglin Xu, Xiong Xiao, Sining Sun, Wei Rao, Eng Siong Chng, Haizhou Li

We study the estimation of time difference of arrival (TDOA) under noisy and reverberant conditions. Conventional TDOA estimation methods such as MUltiple SIgnal Classification (MUSIC) are not robust to noise and reverberation due to the distortion in the spatial covariance matrix (SCM). To address this issue, this paper proposes a robust SCM estimation method, called weighted SCM (WSCM). In the WSCM estimation, each time-frequency (TF) bin of the input signal is weighted by a TF mask which is 0 for non-speech TF bins and 1 for speech TF bins in ideal case. In practice, the TF mask takes values between 0 and 1 that are predicted by a long short term memory (LSTM) network trained from a large amount of simulated noisy and reverberant data. The use of mask weights significantly reduces the contribution of low SNR TF bins to the SCM estimation, hence improves the robustness of MUSIC. Experimental results on both simulated and real data show that we have significantly improved the robustness of MUSIC by using the weighted SCM.

Subject: INTERSPEECH.2017 - Speech Processing

#18 Speaker Direction-of-Arrival Estimation Based on Frequency-Independent Beampattern [PDF] [Copy] [Kimi¹] [REL]

Authors: Feng Guo, Yuhang Cao, Zheng Liu, Jiaen Liang, Baoqing Li, Xiaobing Yuan

The differential microphone array (DMA) becomes more and more popular recently. In this paper, we derive the relationship between the direction-of-arrival (DoA) and DMA’s frequency-independent beampatterns. The derivation demonstrates that the DoA can be yielded by solving a trigonometric polynomial. Taking the dipoles as a special case of this relationship, we propose three methods to estimate the DoA based on the dipoles. However, we find these methods are vulnerable to the axial directions under the reverberation environment. Fortunately, they can complement each other owing to their robustness to different angles. Hence, to increase the robustness to the reverberation, we proposed another new approach by combining the advantages of these three dipole-based methods for the speaker DoA estimation. Both simulations and experiments show that the proposed method not only outperforms the traditional methods for small aperture array but also is much more computationally efficient with avoiding the spatial spectrum search.

Subject: INTERSPEECH.2017 - Speech Processing

#19 A Mask Estimation Method Integrating Data Field Model for Speech Enhancement [PDF] [Copy] [Kimi¹] [REL]

Authors: Xianyun Wang, Changchun Bao, Feng Bao

In most approaches based on computational auditory scene analysis (CASA), the ideal binary mask (IBM) is often used for noise reduction. However, it is almost impossible to obtain the IBM result. The error in IBM estimation may greatly violate smooth evolution nature of speech because of the energy absence in many speech-dominated time-frequency (T-F) units. To reduce the error, the ideal ratio mask (IRM) via modeling the spatial dependencies of speech spectrum is used as an optimal target mask because the predictive ratio mask is less sensitive to the error than the predictive binary mask. In this paper, we introduce a data field (DF) to model the spatial dependencies of the cochleagram for obtaining the ratio mask. Firstly, initial T-F units of noise and speech are obtained from noisy speech. Then we can calculate the forms of the potentials of noise and speech. Subsequently, their optimal potentials which reflect their respective distribution of potential field are obtained by the optimal influence factors of speech and noise. Finally, we exploit the potentials of speech and noise to obtain the ratio mask. Experimental results show that the proposed method can obtain a better performance than the reference methods in speech quality.

Subject: INTERSPEECH.2017 - Speech Processing

#20 Improved End-of-Query Detection for Streaming Speech Recognition [PDF] [Copy] [Kimi¹] [REL]

Authors: Matt Shannon, Gabor Simko, Shuo-Yiin Chang, Carolina Parada

In many streaming speech recognition applications such as voice search it is important to determine quickly and accurately when the user has finished speaking their query. A conventional approach to this task is to declare end-of-query whenever a fixed interval of silence is detected by a voice activity detector (VAD) trained to classify each frame as speech or silence. However silence detection and end-of-query detection are fundamentally different tasks, and the criterion used during VAD training may not be optimal. In particular the conventional approach ignores potential acoustic cues such as filler sounds and past speaking rate which may indicate whether a given pause is temporary or query-final. In this paper we present a simple modification to make the conventional VAD training criterion more closely related to end-of-query detection. A unidirectional long short-term memory architecture allows the system to remember past acoustic events, and the training criterion incentivizes the system to learn to use any acoustic cues relevant to predicting future user intent. We show experimentally that this approach improves latency at a given accuracy by around 100 ms for end-of-query detection for voice search.

Subject: INTERSPEECH.2017 - Speech Processing

#21 Using Approximated Auditory Roughness as a Pre-Filtering Feature for Human Screaming and Affective Speech AED [PDF] [Copy] [Kimi¹] [REL]

Authors: Di He, Zuofu Cheng, Mark Hasegawa-Johnson, Deming Chen

Detecting human screaming, shouting, and other verbal manifestations of fear and anger are of great interest to security Audio Event Detection (AED) systems. The Internet of Things (IoT) approach allows wide-covering, powerful AED systems to be distributed across the Internet. But a good feature to pre-filter the audio is critical to these systems. This work evaluates the potential of detecting screaming and affective speech using Auditory Roughness and proposes a very light-weight approximation method. Our approximation uses a similar amount of Multiple Add Accumulate (MAA) compared to short-term energy (STE), and at least 10× less MAA than MFCC. We evaluated the performance of our approximated roughness on the Mandarin Affective Speech corpus and a subset of the Youtube AudioSet for screaming against other low-complexity features. We show that our approximated roughness returns higher accuracy.

Subject: INTERSPEECH.2017 - Speech Processing

#22 Improving Source Separation via Multi-Speaker Representations [PDF] [Copy] [Kimi¹] [REL]

Authors: Jeroen Zegers, Hugo Van hamme

Lately there have been novel developments in deep learning towards solving the cocktail party problem. Initial results are very promising and allow for more research in the domain. One technique that has not yet been explored in the neural network approach to this task is speaker adaptation. Intuitively, information on the speakers that we are trying to separate seems fundamentally important for the speaker separation task. However, retrieving this speaker information is challenging since the speaker identities are not known a priori and multiple speakers are simultaneously active. There is thus some sort of chicken and egg problem. To tackle this, source signals and i-vectors are estimated alternately. We show that blind multi-speaker adaptation improves the results of the network and that (in our case) the network is not capable of adequately retrieving this useful speaker information itself.

Subject: INTERSPEECH.2017 - Speech Processing

#23 Multiple Sound Source Counting and Localization Based on Spatial Principal Eigenvector [PDF] [Copy] [Kimi¹] [REL]

Authors: Bing Yang, Hong Liu, Cheng Pang

Multiple sound source localization remains a challenging issue due to the interaction between sources. Although traditional approaches can locate multiple sources effectively, most of them require the number of sound sources as a priori knowledge. However, the number of sound sources is generally unknown in practical applications. To overcome this problem, a spatial principal eigenvector based approach is proposed to estimate the number and the direction of arrivals (DOAs) of multiple speech sources. Firstly, a time-frequency (TF) bin weighting scheme is utilized to select the TF bins dominated by single source. Then, for these selected bins, the spatial principal eigenvectors are extracted to construct a contribution function which is used to simultaneously estimate the number of sources and corresponding coarse DOAs. Finally, the coarse DOA estimations are refined by iteratively optimizing the assignment of selected TF bins to each source. Experimental results validate that the proposed approach yields favorable performance for multiple sound source counting and localization in the environment with different levels of noise and reverberation.

Subject: INTERSPEECH.2017 - Speech Processing

#24 Subband Selection for Binaural Speech Source Localization [PDF] [Copy] [Kimi¹] [REL]

Authors: Girija Ramesan Karthik, Prasanta Kumar Ghosh

We consider the task of speech source localization using binaural cues, namely interaural time and level difference (ITD & ILD). A typical approach is to process binaural speech using gammatone filters and calculate frame-level ITD and ILD in each subband. The ITD, ILD and their combination (ITLD) in each subband are statistically modelled using Gaussian mixture models for every direction during training. Given a binaural test-speech, the source is localized using maximum likelihood criterion assuming that the binaural cues in each subband are independent. We, in this work, investigate the robustness of each subband for localization and compare their performance against the full-band scheme with 32 gammatone filters. We propose a subband selection procedure using the training data where subbands are rank ordered based on their localization performance. Experiments on Subject 003 from the CIPIC database reveal that, for high SNRs, the ITD and ITLD of just one subband centered at 296Hz is sufficient to yield localization accuracy identical to that of the full-band scheme with a test-speech of duration 1sec. At low SNRs, in case of ITD, the selected subbands are found to perform better than the full-band scheme.

Subject: INTERSPEECH.2017 - Speech Processing

#25 Unmixing Convolutive Mixtures by Exploiting Amplitude Co-Modulation: Methods and Evaluation on Mandarin Speech Recordings [PDF] [Copy] [Kimi¹] [REL]

Authors: Bo-Rui Chen, Huang-Yi Lee, Yi-Wen Liu

This paper presents and evaluates two frequency-domain methods for multi-channel sound source separation. The sources are assumed to couple to the microphones with unknown room responses. Independent component analysis (ICA) is applied in the frequency domain to obtain maximally independent amplitude envelopes (AEs) at every frequency. Due to the nature of ICA, the AEs across frequencies need to be de-permuted. To this end, we seek to assign AEs to the same source solely based on the correlation in their magnitude variation against time. The resulted time-varying spectra are inverse Fourier transformed to synthesize separated signals. Objective evaluation showed that both methods achieve a signal-to-interference ratio (SIR) that is comparable to Mazur et al (2013). In addition, we created spoken Mandarin materials and recruited age-matched subjects to perform word-by-word transcription. Results showed that, first, speech intelligibility significantly improved after unmixing. Secondly, while both methods achieved similar SIR, the subjects preferred to listen to the results that were post-processed to ensure a speech-like spectral shape; the mean opinion scores were 2.9 vs. 4.3 (out of 5) between the two methods. The present results may provide suggestions regarding deployment of the correlation-based source separation algorithms into devices with limited computational resources.

Subject: INTERSPEECH.2017 - Speech Processing